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ABSTRACT 


Techniques  are  developed  for  the  design  of  a  monitor  of  a  real-time 
multi-computer  system  that  is  under  heavy  loading.  The  first  portion 
relates  to  the  requirements  of  partitioning  to  aid  in  fault  recognition 
and  diagnostic  routines.  The  dynamic  allocation  of  system  time  to  the 
system  tasks  and  fault  monitoring  is  developed  secondly.  System 
reconfiguration  of  the  partitioned  subsystems  restores  the  system  to 
operation  at  a  degraded  level  until  faults  are  corrected.  The  paper 
discusses  a  Ship  Combat  Weapon  System  as  an  example  of  a  large  scale 
multi-computer  system  monitor. 
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I.   INTRODUCTION 

The  monitoring  of  a  small  computer  system  is  a  relatively  simple 
task.  But  as  computer  systems  grow  larger  and  larger,  the  problem 
becomes  much  greater.  When  the  monitoring  of  a  large  scale  Real-Time 
Mult i -Computer  system  is  attacked,  many  problems  arise.   One  must  know 
which  program  is  running  in  which  computer  and  how  each  program  affects 
the  mass  of  data  that  flows  between  the  many  input/output  ports  of  the 
system.  When  the  system  is  running  near  capacity  in  all  processors 
(heavily  loaded),  the  problem  becomes  even  greater.   There  is  little 
time  left  to  process  data  for  monitoring.   Therefore,  it  can  be  seen 
that  monitoring  must  be  done  under  very  severe  timing  and  space 
utilization  constraints.   This  thesis  provides  the  necessary  tools 
for  calculating  timing  and  core  space  requirements . 

The  five  major  portions  of  a  fault  monitoring  system  are 
(l)  Partitioning  of  the  total  system  into  small  subsystems,  (2)  Dynamic 
time  allocation  to  better  utilize  remaining  time  for  fault  monitoring 
procedures,  (3)  Fault  recognition  techniques,  (k)   Diagnostic  routines, 
(5)  System  reconfiguration  to  automatically  restore  system  operation. 
All  five  of  these  operations  are  fully  discussed  and  analyzed  so  that 
the  designer  may  apply  the  correct  techniques  to  his  system  to  obtain 
an  effective  monitor. 

One  example  of  a  Real-Time  Mult i -Computer  system  is  a  Ship  Combat 
Weapon  System.   The  general  aspects  of  a  Ship  Combat  Weapon  System  are 
discussed  to  show  the  comparisons  of  a  general  system  to  a  specific 
one.  A  simulated  Ship  Combat  Weapon  System  is  shown  to  contain  all 


the  necessary  parts  of  the  general  system  and  is  analyzed  in  great 
detail.  The  methods  of  monitoring  this  simulated  Ship  Comhat  Weapon 
System  are  analyzed  to  develop  the  necessary  monitoring  techniques. 

A  detailed  "bibliography  is  presented  that  covers  the  area  of 
Monitoring  of  a  Real-Time  Multi-Computer  System.  A  cross  reference  of 
the  methods  of  monitoring  is  also  given. 

Proper  system  time  utilization  shows  that  much  of  the  system 
upkeep  and  maintenance  that  normally  is  accomplished  during  designated 
maintenance  periods  may  now  he  performed  on-line  at  little  or  no 
system  degradation. 


II.  BASIC  ELEMENTS  OF  A  FAULT  MONITORING  SYSTEM 

Fault  detection  and  recognition  is  the  most  important  maintenance 
function  in  a  Real-Time  System.  The  Fault  Recognition  program 
"basically  tests  the  processing  integrity  of  the  system.  This  program 
requests  that  the  subsystem  found  faulty  or  suspected  of  "being  faulty 
"be  diagnosed.  The  purpose  of  the  diagnostic  program  is  to  generate 
test  data  to  isolate  the  fault  to  a  reasonably  small  section  within 
the  subsystem. 

Real-time  fault  detection  historically  has  been  done  at  the  circuit 
hardware  level.  In  the  early  stages  of  development  of  fault-tolerant 
computers,  attention  was  directed  towards  massive  redundancy  at  the 
lowest  level  -  the  replication  of  individual  components  (resistors, 
transistors,  etc.).  The  use  of  component  redundancy  has  been  limited 
by  design  difficulties  and  by  new  developments  in  componet  technology. 
The  change  from  discrete  components  to  integrated  circuits  has  largely 
invalidated  the  assumption  of  independent  component  failures.  Without 
it, the  advantages  of  component  redundancy  are  lost. 

The  most  developed  techniques  are  fault  detection  by  periodic 
diagnosis  and  the  application  of  parity  and  similar  error  codes  to 
detect  or  correct  errors  in  data  transmission  and  storage.  The 
periodic  diagnosis  techniques  have  progressed  from  exclusively  software 
implementations  to  software  combinations  with  special-purpose  hardware. 
Concurrent  diagnosis  uses  error  detecting  codes  and  monitoring  circuits. 


A.  HARDWARE 

Fault  detection  and  diagnosis  "by  hardware  have  greatly  increased 
the  sensitivity  and  selectivity  of  finding  and  correcting  errors.  In 


the  early  days,  fault  detection  systems  utilized  registers  that  read 

iy 
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data  at  specific  times  into  an  output  device.   '    Later  specially 


built  hardware  devices  set  off  alarms  when  an  error  was  detected 
The  circuit  that  produced  the  fault  was  located  by  utilizing  a  book 
which  contained  manual  diagnosis.  As  designers  progressed,  circuits 
were  designed  to  not  only  send  interrupts  to  the  computer  notifying 
it  of  an  error,  but  also  allowing  the  diagnostic  routine  programmatic 
access  of  the  error  register. L4,-)'  J 

To  assist  in  locating  faults,  the  hardware  system  may  be  partitioned 
into  logical  subelements  that  allow  reconfiguration.  Accurate  timing 
of  events  requires  a  real  time  clock  in  the  system.  Transient  or 
permanent  faults  may  be  initially  detected  by  hardware  devices  but 
efficient  identification  and  location  of  the  faults  requires  software 
diagnostic  routines.  Diagnostic  routines  are  built  upon  the  concept  of 
detecting  faults  by  executing  one  instruction  at  a  time.  As  the 
instructions  become  more  complicated,  more  circuitry  (Microsteps)  is 
analyzed  for  faults.  This  is  repeated  until  all  elements  are  insured 
to  be  fault  free  or  a  fault  is  located. 

It  can  be  seen  that  to  effectively  detect  an  error  in  a  timely 
manner  requires  hardware  circuits.  Programmatic  access  to  error 
registers  allows  greater  flexibility  and  speed  in  diagnosing  the 
actual  fault. 


B.   SOFTWARE 

Many  studies  and  investigations  have  been  made  in  the  area  of 
fault-detection  and  diagnosis  "by  software.   Even  the  earliest  computer 
systems  had  diagnostic  programs  to  check  the  computers  for  errors.  ' 
As  computers  became  more  complex,  the  size  of  diagnostic  routines  in- 
creased as  did  the  time  required  to  write  them  (in  man  years).   ,->'  '^ 
By  combining  fault-detection  with  diagnostic  routines,  the  total  time 
to  locate  an  error  was  reduced.  By  allowing  periodic  maintenance 
checks  to  he  performed  using  this  combined  method,  large  computer 
systems  reduced  their  amount  of  down  time.   '   '   *     By  combining 
fault-detection  and  diagnostic  routines  with  automatic  system  re- 
organization, the  clown  time  may  be  reduced  to  a  minimum  with  imposed 

4.  a         a   +•   C  4 , 1 4 , 1 53 

system  degradation.  nf  n» 

Software  must  also  be  partitioned  into  logical  subelements  to 
allow  for  program  relocation  or  reconfiguration.  Knowing  when  to 
reconfigure  requires  monitoring  the  most  critical  data  of  each  logical 
subelement  program.  When  critical  data  of  a  program  are  detected  as 
being  faulty,  then  the  program  itself  is  faulty. 

When  the  combined  technique  of  fault-detection,  diagnostic  routines 
and  automatic  system  reorganization  is  used  in  a  large  system, 
additional  problems  occur.  Finding  time  to  run  the  required  tests 
is  a  problem  in  a  heavily  loaded  system.  If  there  is  barely  enough 
time  to  complete  the  required  tasks,  how  can  we  allow  extra  time  for 
maintenance  tests?  The  answer  implies  some  type  of  dynamic  time 
allocation.  Another  problem  is  the  manner  of  presenting  this  detailed 
data  to  the  system  monitor  operator  in  a  timely  manner.  The  operator 
must  have  enough  data,  but  in  a  short  time,  to  allow  him  to  complete 
the  action  required  of  him  before  the  total  system  fails. 
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IH«  DEFINITIONS  AND  CONCSPTS  OF  FAULT  MONITORING 

in  a  miti-proc^sing  SYSTIJ 


Large  and  complex  computer  systems  increase  the  demand  for  fault 
monitoring  systems.  Because  of  this  complexity,  man  "by  himself 
requires  too  much  time  to  solve  the  same  problem.  The  cost  of  uncorrected 
errors  is  especially  severe  in  large  multi-computer  systems  and  in 
situations  which  a  computer  controls  a  very  valuable  system,  and  is 
not  readily  accessible  for  human  repair.  Examples  are  a  real-time 
control  computer  and  a  spacecraft  computer  controlling  an  inter- 
planetary mission.  A  second  critical  requirement  for  fault  monitoring 
exists  when  human  lives  may  be  affected  by  computer  errors,  (e.g. 
military  defense  systems,  high-speed  transportation  control  systems, 
or  medical  systems).    ^     The  time  to  repair  such  complex  system  must 
be  reduced,  and  fault  monitoring  is  one  approach. 

The  five  necessary  parts  of  fault  monitoring  are:   (l)  Partitioning, 
(2)  Dynamic  time  allocation,  (3)  Fault  recognition,  (4)  Diagnostic 
routines,  (5)  System  reconfiguration.  These  must  be  considered  in 
detail  so  that  the  total  effect  may  result  in  an  efficient  optimal 
system  monitor.  The  following  description  describes  the  necessary 
elements  of  a  fault  monitoring  multi-computer  system.  Each  part  of 
fault  monitoring  will  first  be  defined  in  detail.  Secondly,  some 
explicit  uses  of  these  parts  will  be  given  for  fault  monitoring. 

A.  PARTITIONING 

Partitioning  is  the  process  by  which  a  large  complicated  system 
is  divided  into  logical  subelements.  Each  subelement  has  a  specific 
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function  in  either  hardware  or  software.  By  requiring  each  subelement 
to  he  a  logical  subdivision,  it  may  then  "be  replaced  in  case  of  a 
detected  failure.     *   J  An  example  is  a  program  subroutine  that 
is  located  in  faulty  computer  memory  core;  it  may  be  relocated  to  a 
fault  free  core  location.  Another  example  is  a  detected  fault  in  an 
input/output  (i/o)  channel;  the  monitoring  system  would  reconfigure 
the  system  input/outputs  so  as  to  utilize  another  channel.  If  a 
Central  Processing  Unit  (CPU)  failed,  its  tasks  (processing  of  logical 
program  subelements)  could  be  allocated  to  other  CPU's  in  a  reduced 
operating  mode. 

The  most  important  requirement  for  successful  partitioning  is  to 
segment  the  system  into  logical  elements,  each  having  the  capability 
of  "being  relocated  by  the  system  reconfiguration  module. 
Thus  it  is  required  that  core  memories,  for  example,  he  divided  into 
modules,  regardless  of  the  actual  computer  memory  organization.  The 
hardware  items  could  be  partitioned  into  CPU's, core  memory  modules  and 
input /output  channels.  Thus  when  one  hardware  item  fails  and  another 
similar  hardware  unit  is  free,  (or  only  partly  utilized)  it  may  he 
used  immediately  by  the  system  monitoring  reconfiguration  program.  In 
a  similar  manner,  all  computer  software  programs  and  subprograms 
should  he  partitioned  into  logical  units  of  approximately  equal  core 
size  so  that  immediate  reconfiguration  may  he  performed.  Programs 
of  unequal  size  would  create  the  prohlem  of  moving  all  programs  in 
the  computer.  If  a  display  device  should  fail  and  the  display  processor 
program  becomes  idle,  the  system  could  he  reconfigured  to  use  the 
teletype  (or  some  other  output  device)  along  with  the  teletype 
processing  module. 
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B*  DYNAMIC  TIME  ALLOCATION 

Time  slicing  is  the  division  of  the  total  computer  system  time 
available.  This  system  time  is  allocated  to  all  of  its  component 
parts  and  programs.   '     Time  slicing  is  used  to  minimize  the 
time  required  for  fault  detection  and  diagnosis  in  any  one  time  frame. 
Luring  routine  system  operations  when  the  system  is  lightly  loaded,  there 
are  large  blocks  of  time  available  in  each  executive  cycle  for  fault 
recognition  and  diagnostic  analysis.  As  the  system  becomes  more  in 
demand,  the  time  available  for  analysis  is  shortened.  In  order  to 
utilize  this  time  more  effectively,  the  fault  monitoring  program  must 
dynamically  allocate  the  available  time.  Thus  the  program  must  know- 
how  much  time  is  available  for  use  and  thus  how  much  diagnosis  may  be 
performed  in  this  specific  cycle.  Flags  (or  some  other  method)  must 
be  set  and  the  proper  bookkeeping  performed  to  insure  that  the  most 
critical  data  is  still  monitored  and  analyzed  during  the  most  heavily 
loaded  period  of  system  utilization.  Luring  lull  periods,  the 
monitoring  program  must  also  insure  that  all  components  of  the 
computer  system  are  analyzed  so  as  to  insure  complete  system 
integrity,  t12'1^ 

It  follows  directly  that  fault  detection  routines  must  be  timed 
and  must  be  able  to  operate  under  flag  (executive)  control.  By  the 
proper  allocation  of  these  routines  to  the  pertinent  tasks  at  hand, 
all  criteria  may  be  satisfied.  Strict  timing  control  of  the  main 
and  monitoring  program  must  be  performed  and  must  be  programmatic ally 
available. 

By  dynamically  controlling  the  time  used  for  fault  monitoring,  a 
wide  range  of  operational  modes  may  be  effectively  monitored.  These 
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modes  may  vary  from  lightly  loaded  systems  to  very  heavily  loaded 
systems.   In  a  lightly  loaded  system  most  all  operations  are  concen- 
trated towards  fault  monitoring,  diagnostic  analysis  and  maintenance. 
Under  heavy  loading,  the  time  for  monitoring  is  very  small.   In  most 
systems,  it  is  zero  except  for  hardware  fault  monitors.  By  dynamically 
allocating  a  small  segment  of  time  to  special  purpose  fault  monitoring 
routines,  increases  reliability  may  be  gained  with  little  system 
interference. 

C.   FAULT  RECOGNITION 

Fault  detection  in  digital  computers  is  implemented  either  by 
periodic  or  by  concurrent  diagnosis.   The  most  common  current  approach 
is  periodic  diagnosis  which  utilizes  a  diagnostic  program  stored  in 
the  computer  memory.   J     Computation  is  periodically  interrupted 
and  the  diagnostic  program  is  executed.   The  diagnostic  program  itself 
is  vulnerable  to  faults  in  the  memory  system.   The  cost  of  diagnosis 
consists  of:   (l)  the  storage  used  for  the  diagnostic  program,  (2)  the 
time  consumed  by  its  execution,  (3)  the  time  needed  for  repair,  (h)   the 

repeated  execution  (rollback)  of  the  program  segment  which  was  run 

[ll+l 
after  the  last  diagnosis.      Such  time  and  storage  costs  are  very 

severe  in  real-time  computing.   The  alternate  diagnosis  method  is 

concurrent  diagnosis  in  which  error -detecting  codes  and  monitoring 

circuits  are  employed  to  indicate  the  presence  of  faults . 

A  distinction  must  be  made  in  fault  detection  between  transient 

and  permanent  errors .  By  maintaining  a  history  of  detected  errors 

with  no  diagnosable  faults,  a  trend  of  transient  errors  may  be  stored. 

This  trend  may  be  utilized  to  determine  an  impending  major  fault.   In 
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critical  locations,  special  hardware  devices  must  be  installed  to 
detect  errors  that  would  remain  undetected  or  unrecoverable  by  software 
diagnosis.  For  example,  the  current  instruction  address,  in  the 
location  counter,  may  be  required  to  locate  a  fault.  If  the  location 
counter  is  not  pro grammatically  available  concurrent  with  the  fault 
detection,  then  a  special  fault  location  register  is  required.  This 
register  would  automatically  copy  the  contents  of  the  location  counter 
at  the  time  of  any  fault.  4j14^ 

Faults  nay  originate  from  either  hardware  or  software.  They  may 
also  be  detected  by  either  hardware,  software  or  a  combination  of  both, 
A  list  of  all  the  pertinent  errors  to  be  detected  is  required.  From 
this  list  a  division  must  be  made  between  the  two  types  of  fault 
origin,  hardware  or  software.  This  decision  is  influenced  by  the 
method  of  fault  detection.  This  list  is  used  in  the  process  of 
determining  partitions.  Each  fault  detection  technique  used  depends 
upon  many  system  factors  that  must  be  taken  into  account.  How  the 
system  is  partitioned  affects  the  grouping  of  faults.  This  grouping 
of  faults  is  used  by  the  dynamic  time  allocation  routine. 

D.  DIAGNOSTIC  ROUTERS 

The  diagnostic  program  is  designed  to  isolate  and  specify  errors 
in  main-frame  arithmetic  and  control  logic,  the  various  information 
transfers,  the  various  devices  and  registers.  The  program  is  con- 
structed on  the  general  basis  that  every  command  in  the  machine 
repertoire  uses  a  unique  set  of  microinstructions  or  microsteps, 
leads  to  a  correct  result  and  a  second  command  using  the  same  set 
plus  one  leads  to  an  incorrect  result,  then  the  failure  is  assumed  to 
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"be  in  the  additional  microstep.  The  detailed  diagnosis  of  errors 
requires  that  the  possiblilitie3  of  control  signal  failure,  trans- 
mission path  failure,  and  register  failure,  "be  investigated.  It  is 
often  difficult  to  separate  these  types/- 4,5,14»1^ 

For  a  large  scale  real-time  multi-computer  system,  fauXt  diagnosis 
routines  become  even  more  complex  and  time  consuming  than  the  diagnostic 
routine  just  described.  To  eliminate  these  problems,  a  modularized 
systems  approach  must  be  utilized.  Rather  than  be  concerned  with  a 
specific  circuit  element  failing,  we  concentrate  on  detecting  modular 
subsystem  errors.  Critical  data  are  monitored  so  as  to  indicate 
failures  in  any  one  of  our  subsystem  modules.  Upon  confirming  a 

permanent  error  in  a  module,  the  system  reconfiguration  program  is 
called.[^4] 

E.  SYSTEM  .1EC0KPIGURATI01T 

When  permanent  faults  are  detected  and  analyzed  in  a  complex 
system,  the  total  system  may  halt  or  the  system  may  be  reconfigured 
to  avoid  the  faulting  component  (partitioned  submodule)  and  operated  at 
a  reduced  level.   *   '     As  discussed  before,  halting  a  valuable 
system  is  not  acceptable.  Reconfiguration  may  be  manual  or  automatic. 
In  the  automatic  mode,  the  computer  system  must  maintain  a  current 
configuration  list  of  all  submodules  and  their  operational  status. 
Upon  notification  of  a  submodule  failing  (that  is  part  of  the  operational 
system),  reconfiguration  is  forced.  When  a  submodule  is  repaired,  and 
proven  operational,  the  system  monitor  operator  may  indicate  this  fact 
to  the  program  and  then  request  reconfiguration.      In  the  manual 
mode,  the  system  mu3t  be  examined  manually,  the  new  configuration 
determined  and  then  manually  implemented. 
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The  reconfiguration  program  contains  the  necessary  information  to 
logically  interconnect  all  submodules  and  to  relocate  computer  programs 
when  necessary.  By  displaying  the  proposed  reconfigured  computer 
system  to  the  system  monitor  operator,  approval  may  he  given  and  the 
new  system  implemented.  Automatic  reconfiguration  could  save  as  much 
as  99  percent  of  the  time  required  for  manual  reconfiguration. 

While  automatic  reconfiguration  gives  an  indication  of  overly 
complicating  the  problem,  it  is  in  reality  a  simplification.  To 
manually  maintain  the  configuration  control  of  a  large  multi-computer 
system  is  a  large  team  effort.  Many  charts,  manuals,  and  switches 
must  he  coordinated  with  exacting  precision.  Automation  of  this  task 
reduces  this  problem.  The  system  configurs/tion  is  maintained  up  to 
date  in  the  computer  memory.  Switching  and  logical  control  of  the 
input/output  ports  are  controlled  by  computer  subprograms,  \Jhen 
these  aids  are  implemented  in  a  reconfiguration  module  program,  the 
time  required  to  change  a  configuration  is  reduced  to  seconds. 

Because  of  the  hypercritical  nature  of  the  fault  monitoring  process, 
special  precautions  must  be  observed.  The  fault  monitoring  program 
can  be  duplicated  in  another  computer  or  reside  in  a  special  fault 
tolerant  computer.  These  precautions  reduce  the  danger  of  a  fault 
occurring  during  execution  of  the  reconfiguration  program. 
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W*     A  SPECLFIC_  EXAMPLE;  A  SHIP  COMBAT 
WEAPOH  SYSTEM 


One  example  of  a  complicated  real-time  multi-computer  ay-stem  i3 
a  ship  combat  weapon  system.  Many  computers  are  utilized  to  solye 
special  subsystem  tasks,  such  as  target  detection  and  missile  firing. 
The  total  computer  complex  is  integrated  together  by  the  Combat 
Information  Center  System.  In  this  real-time  system,  many  control 
and  data  processing  functions  must  be  performed  at  extremely  high' 
speeds.  Most  of  the  sensory  and  control  devices  must  be  electrically 
connected  on-line  to  the  system  to  permit  automatic  transfer  of  data. 
Delays  in  transferring  data  by  means  of  manual  off-line  handling  of 
tapes,  cards,  etc.,  are  not  acceptable. 

An  executive  control  philosophy  was  developed  for  distributing 
the  various  tasks  among  the  computers.  Control  of  these  tasks  in 
each  computer  is  maintained  by  an  Executive  Routine.  Over-all  control 
of  the  multi-computer  system  i3  not  employed  since  each  computer  is 
controlled  by  its  own  Executive  Routine.  Subroutines  or  tasks  are 
controlled  by  the  Executive  Routine  by  the  use  of  flags  or  alerts. 
Decisions  on  whether  to  respond  to  the  flag  or  alert  at  any  specific 
time  are  determined  by  the  priority  of  the  input  and  the  time 
available  to  do  the  task.  This  accurate  timing  is  made  possible 
by  use  of  an  internal  real-time  clock.  When  each  task  is  completed, 
the  flags  and  alerts  are  sensed  again  and  the  highest  priority  task 
remaining  is  performed.  Both  periodic  and  demand  type  tasks  are 
utilized  by  executive  routines. 
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A  Ship  Combat  Weapon  System  is  an  example  of  a  large  real-time 
mult i- computer  system  in  actual  use.  A  detailed  study  of  this  system 
will  "be  explored  for  the  methodology  of  Fault  Monitoring  techniques. 
The  methods  evolved  for  Fault  Monitoring  in  a  Ship  Combat  Weapon 
System  will  apply  towards  most  large  real-time  mult i -computer  systems. 

A.   PROPERTIES  OF  A  COMBAT' WEAPON  SYSTEM  AND  THE  DATA  INVOLVED 

A  typical  Ship  Combat  System  consists  of  many  interconnected 
systems  (see  fig.  l).   They  are  divided  into  three  major  areas: 
(l)  Input,  (2)  Processing  and  (3)  Output.  All  different  types  of  ' 
input  devices  are  analyzed  by  the  input  processors  and  the  relevant 
data  transmitted  to  the  processing  section.   The  processing  section 
correlates  the  data  from  the  different  sensors.   The  correlated  data  ±i 
transferred  to  the  appropriate  output  device  (Guns,  Missiles). 

Each  of  these  three  areas  have  in  turn  many  components,  some  of 
which  are: 

1.  Inputs 

A.  Radar  video 

B.  Sonar  input 

C.  IFF  input 

D.  Ship  information 

2.  Processing  (Combat  System  Decisions) 

A.  Three  processing  computers 

B.  Five  types  of  display  consoles 

(1)  Radar 

(2)  Sonar 

(3)  CIC  (Tactical) 


18 


OBOO 


QHCO|Xi,-h'<>hco 


CO 

o 

O  K 

K  O 

(X,    CO 

(T, 

Eh 

<n 

t=> 

Pn 

o 

p*, 

CO 

H 

i 

CO 

CO 

m 

w 

o 

o 

O  rr; 

O  K 

K   O 

C£   o 

ex,  to 

(X,  CO 

Eh 

fe 

&                           Q-. 

o 

Ph 

(X,                             H 

fe 

H 

S                      K 

s 

M                             CO 

M 

1 

/'X 

1 

Vj(sj 

£ 

s 

w 

CO 

S3 

Eh 

O 

S 

P-i 

W 

s 

£5 

o 

O 

o 

(X, 

S 

Q 

O 
O 

£ 

<Z 

S 
B 

CO 

H 

:>< 

CO 

CO 

1 

d 

a 

19 


(4)  Weapons 

(5)  Monitor 
5.  Outputs 

A.  Missile  Fixe  Control  System 

(1)  Tracking  mount 

(2)  Missile  mount 

B.  Gun  Fire  Control  System 

(1)  Tracking  mount 

(2)  Gun  mount 

C.  Sonar  Fire  Control  System  and  Torpedo  Mount 

B.  TEE  SIMULATED  COMBAT  WEAPON  SYSTEM 

The  Ship  Combat  Weapon  System  described  in  figure  1  is  very 
complicated.  To  analyze  this  system  in  detail  for  Fault  Monitoring 
purposes  is  a  large  task,  A  representative  system  will  "be  simulated 
instead  (see  fig.  2).  The  simulation  -will  contain  Input,  Processing 
and  Output  sections.  By  including  one  system  with  each  function,  a 
representative  but  reasonably  sized,  heavily  loaded  system  may  still 
be  simulated. 

The  Naval  Postgraduate  School  has  the  unique  computational 
facilities  of  a  large  system  simulation  laboratory  with  three  digital 
computers  and  one  analog  computer:  the  Xerox  Bata  Systems  (XDS)  9500 
medium  scale  digital  computer,  two  Adage  Graphic  Terminals  AGT-10, 
and  the  Comcor  CI-5000  analog  computer.  Each  digital  computer  is 
assigned  a  major  task  of  the  Ship  Combat  Weapon  System  while  the 
analog  computer  (CT-5000)  simulates  the  physical  missile  mount.  The 
physical  identity  of  the  computers,  the  system  functions  and  the 
computer  programs  may  be  identified  in  figure  3» 


20 


21 


The  Adage  Graphic  Terminal  computer  number  one  (Adage  1  computer) 
contains  the  Radar,  the  Ship  processor  and  the  Monitor  programs. 
Adage  Graphic  Terminal  computer  number  two  (Adage  2  computer)  contains 
the  Combat  System  Decisions  program.  The  XDS-9300  computer  contains 
the  Missile  Fire  Control  System  Program  and  the  CI-5000  operates  as 
the  Missile  Mount  Simulation. 

1.  Types  of  Data  Involved 

Some  typical  types  of  data  involved  in  a  Ship  Combat  Weapon 
System  are  shown  in  table  1.  Each  of  the  five  sections  of  table  1 
are  typical  of  the  data  that  each  system  would  contain.  This  data 
is  utilized  by  the  Fault  Monitoring  program  to  analyze  the  Combat 
System  for  the  detection  of  errors.  The  fault  monitoring  program, 
will  in  turn  pass  this  data  to  the  Diagnostic  program  for  further 
analysis  and  evaluation. 

The  following  types  of  data  may  be  sampled  for  error  detection 
and  fault  location  by  an  indication  of  large  jumps  in  the  data. 

1 .  Radar  azimuth 

2.  Target  range  and  bearing 

3.  Gyro  position,  azimuth,  pitch,  and  roll 
4«  Speed 

5.  Intercept  point 

6.  Time  to  fire 

7.  Time  to  go 

8.  Launcher  angle  ordered 

9.  Launcher  limits,  data  sample  rate  and  bearing 
10.  Missile  mount  bearing  and  elevation 

normally,  this  data  would  be  a  continuous  stream. 
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Table  1 
Types  of  Data  in  the  Simulated  System 


1,  Radar  Video  &  Processor 

A.  Azimuth 

B.  Target  Data 

1.  Range 

2,  Bearing 

2.  Ship  Information  &  Processor 

A.  Gyro 

1.  Position  (Lat«,  Long.) 

2.  Aximuth  (Heading) 

3.  Pitch,  Roll 

B.  Speed 

3#  Combat  System  Decisions 

A.  Computer  Status 

1 .  Memory  Available 

2,  I/O  Channels  ayailable 
3e  CPU's  available 

B.  Target  Data  (Speed,  Heading) 

C.  Total  System  Configuration 
4.  Missile  Pire  Control  System 

A.  Intercept  Point 

B.  Time  to  Fire 

C.  Time  to  Go 

D.  Target  Destruction  Evaluation 
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(Table  1  Cont.) 

E.  Launcher  Angle  Ordered 

P.  Launcher  Lata  Sample  Rate 

G#  Launcher  Bearing  Limits 
5*  Missile  Mount 

A.  Bearing  (0,0,0) 

B.  Elevation  (0,%f$) 

C.  Operational  Status 

1 .  Errors  in  Bearing  &  Elevation 

2.  Drift  Rates 
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Another  method  of  error  detection  is  "by  the  analysis  of  data 
outside  of  some  physical  limits, 

1.  Angles  greater  than  360°  or  negative  angles. 

2.  Target  data  outisde  of  radar  range  or  negative  range 

3.  Pitch  greater  than  ±  20° 

4.  Roll  greater  than  t   90° 

5»  Ship  speed  greater  than  i  100  knots 

6.  Acceleration  greater  than  reasonable  limits 

A  few  general  rules  may  be  given  to  assist  in  the  detection 
of  faults. 

1.  Monitor  data  normally  assumed  to  be  continuous  end  smooth. 
When  large  deviations  are  detected,  an  error  has  occurred. 

2.  Monitor  physical  data  and  check  for  data  outside  physical 
limitations,  i.e.,  ships  moving  faster  than  100  knots. 

3.  Send  test  data  to  software  subroutines  and  analyze  the 
results. 

4.  Send  test  instructions  to  the  computers  to  check  for  central 
processing  errors. 

The  program  interaction  of  the  Ship  Combat  Weapon  System  is 
described  in  figure  4«  Note  the  generalized  use  of  the  i/O  package  for 
data  exchange  between  programs.  This  general  use  of  a  common  program 
simplifies  the  automatic  reconfiguration  program.  Since  all  programs 
have  the  same  requirements  of  input  and  output,  then  data  handling 
is  the  same.  When  data  from  one  program  looks  the  same  to  another 
program,  then  program  relocation  is  simplified.  A  program  may  be 
moved  from  one  computer  to  another  without  a  chan^  to  the  i/O  package 
program.  Data  sampling  and  fault  monitoring  also  become  simpler  for  the 
reason. 
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2,  Partitioning 
a.  Hardware 

The  hardware  items  of  the  simulated  Ship  Compat  Weapon 
System  were  partitioned  in  section  III  into  four  major  sections  and 
many  subsections.  The  four  major  computer  subsystems  have  become 
the  main  hardware  partitions,   (see  fig.  5)  Each  digital  computer  is 
composed  of  subsections  normally  ascribed  to  digital  computers;  Central 
Processor  Units  (CPU) ,  magnetic  core,  display  units  and  input /output 
channels.  Some  items  like  Radar,  Monitor  and  input  data  have  been 
simulated  with  software  routines  for  lack  of  the  actual  hardware 
devices.  Because  of  the  similarity  of  program  size  and  overall  program 
action,  partitioning  by  subsystems  is  a  reasonable  choice.  By 
monitoring  critical  data  within  these  subprograms,  fault  monitoring 
of  a  subsystem  becomes  simpler  than  monitoring  the  system  in  total.  If 
any  critical  data  from  the  radar  subsystem  is  detected  a3  erroneous, 
the  direct  assumption  by  the  diagnostic  routine  is  that  the  radar 
subsystem  is  at  fault. 

The  analog  computer  contains  a  simulation  of  the  Missile 
Mount  and  therefore,  the  actual  gear  train  and  motor  systems  are 
simulated.  The  logic  control  and  i/O  control  are  hardware  accessable 
devices. 

Note  the  similarity  of  the  hardware  items  among  the  computer 
systems.  This  allows  a  more  direct  method  for  fault  detection  and 
analysis.  Since  all  three  digital  computers  have  central  processor 
units  and  all  have  modular  computer  core  memories  (or  simulated  ones) , 
then  they  are  similar  for  hardware  partitioning.  One  module  of  a 
computer  core  memory  could  be  used  to  replace  another  that  contains 
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faults.  The  replacement  could  "be  in  the  same  computer  or  in  alternate 
computers.  Automatic  reconfiguration  i3  possible  with  similar 
interchangeable  subsystems. 

Table  2  describes  the  critical  data  points  in  the  Combat 
System  used  to  detect  the  various  hardware  errors.  With  data  monitoring 
of  these  hardware  devices,  any  error  may  be  analysed  to  determine  the 
actual  device  at  fault. 
b.  Software 

Programs  have  been  partitioned  into  the  same  four  major 
partitions  as  the  hardware  partitioning.  Each  subsystem  function, 
such  as  Radar  or  Ship  Information,  is  used  to  separate  major  partitions. 
(see  fig.  4)  The  detailed  partitions  are  different  tasks  within  each 
subsystem  function,  such  as  Input  data,  Simulation  data  or  Tracking. 
Because  each  partitioned  programming  task  is  of  approximately  the 
same  magnitude,  relocation  is  greatly  simplified.  Upon  software 
program  reconfiguration,  the  first  step  will  be  to  reload  the  sub- 
program at  fault  into  the  same  computer  as  the  one  it  first  faulted 
in.  If  this  fails,  the  program  will  be  reloaded  into  another  computer 
with  available  space.  Software  reconfiguration  will  be  completed 
faster  this  way  than  by  reloading  the  total  system. 

The  pertinent  software  errors  for  fault  detection  are 
shown  in  table  J.  By  diagnosing  any  of  these  errors,  the  faulting 
subprogram  may  be  easily  determined  in  a  short  time.  The  subprograms 
may  then  be  relocated  by  the  system  reconfiguration  program. 
3»  Interface  Requirements 
a.  Bata  Paths 

In  order  to  examine  the  exchange  of  data  in  a  detailed 
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Table  2 


Hardware  Errors  to  Detect 


1.  Radar 

A.  Azimuth 

B,  Target  Data  (range,  bearing) 

2.  Gyro 

A.  Heading 

B.  Pitch,  Roll 

5.  Pit  Log 
A.  Speed 

4.  Memory 

A,  Read,  Write 
5#  I/O  Channels 

A.  Parity 

6.  CPU 

A.  Incorrect  subroutine  answers 

7.  Missile  Mount 

A,  Launcher  Angle 
B#  Bearing 
C#  Elevation 
D.  Drift  rate 
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Table  5 
Software  Error  Detection 

Ship  Information 

A.  Intercept  Point 

B.  Time  to  Fire 

C.  Time  to  Go 
Radar 

A,  Target  Data  -  Speed,  Heading 
Gyro 

A,  Latitude 

B,  Longitude 

C,  Speed 
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manner,  the  actual  electronic  interface  must  be  minutely  analyzed. 
Table  4  describes  the  interface  equipment  that  is  involved  in  the 
system  simulation  program.  Note  the  two  different  levels  of  core 
memory  accessibility  that  are  detailed  for  the  XDS-9300  computer, 
(accessible  and  inaccessible).  This  is  very  appropriate  since  the 
modern  modular  computer  system  utilizes  this  technique  of  memory 
protection.  Any  data  in  this  protected  area  of  memory  must  first  be 
accessed  from  inaccessable  core  and  placed  in  accessable  core. 

Six  levels  of  data  accessability  are  described  in  table 
4  to  account  for  all  possible  types  of  interfaces.  Some  computer 
systems  may  only  have  one  or  two  levels j  the  more  complex  systems 
may  include  all  six  types.  Systems  of  all  complexities  will  be 
represented  by  these  levels  of  interfacing. 

Table  5  lists  the  data  paths  and  gives  for  each  the  access 
time,  hardware  path,  interface  interference  and  typical  types  of  data 
that  would  be  retrieved.  The  data  paths  are  the  same  as  shown  in 
table  4*     The  access  times  are  the  actual  times  required  for  both  the 
software  and  the  hardware.  The  first  time  specified  is  a  fixed  time, 
the  second  is  the  time  for  each  additional  access.  The  column 
described  as  "hardware  path"  describes  the  actual  path  the  data  takes 
through  the  system.  The  right  arrow  (-»-)  shows  the  path  of  the  data 
from  one  hardware  item  to  another  as  abreviated  in  table  4»  Tn-e 
interface  interference  describes  the  interruption  that  the  data  access 
causes  to  the  other  hardware  and  software  systems.  The  typical  types 
of  data  retrieved  are  as  described  in  table  1. 
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Table  4 

INTERFACE  EQUIPMENT  COMPONENTS 

Abbreviation 

1.  Adage  memory  (core)  AM 

2.  Adage  l/O  program  AP 

3.  Interface  box  (Adage  to  9300 )  IB 

4.  9300  memory,  accessible  (0k-32k)  XMA 

5.  9300  memory,  inaccessible  (0-8k)  XMT 

6.  9300  1/0  program  XP 
7#  Hybrid  interface  box  HI 
8.  Analog  Computer  (CI-5000)  AC 


DEPTH  OP  DATA  ACCESS  (see  table  5) 
(Based  on  Monitor  in  Adage  1) 


1.  Directly  addressible  Adage  memory  (core) 

2.  Indirectly  addressible  Accessible  core  in  9300 

3.  Programatically  accessible  (level  l)  Inaccessible  core  in  9300 

4.  Programatically  accessible  (level  2)  Alternate  Adage  core 

5«  Programatically  accessible  (level  3)  Analog  data  (DAC-ADC) 

6.  Programatically  accessible  (level  4)  Analog  data  indirect 

(by  SCAN  system) 
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b.  A  Specific  Example 

Figure  5  describes  a  specific  example  of  retrieving  data 
from  a  specific  computer.  Note  the  complex  path  that  is  necessary  to 
access  this  data,  ten  data  transfers  in  all.  This  is  to  "be  expected 
in  large  rea.l~time  systems  and  must  be  timed  accurately.  The  data 
must  first  be  requested  from  the  data  sampling  program  located  in 
Adage  1 .  This  request  passes  through  the  XDS-9300  computer  to  the 
Adage  2  computer.  The  Adage  2  computer  must  then  access  the  requested 
data  and  pass  it  back  to  the  Adage  1  computer  via  the  XD3-9300 
computer.  V/hile  this  data  path  is  long,  the  majority  of  the  time 
required  is  for  program  initiations  that  must  he  set-up  (1.95  3n  sec). 
Thereafter,  only  a  short  time  (30  y,   sec)  is  required  for  each 
additional  word  retrieved,  i.e«,  100  words  may  be  transferred  in  5 
m  sec. 

4#  Monitor  Program  Module 

The  Monitor  Program  contains  all  items  for  Fault  Monitoring 
as  discussed.  This  program  is  divided  into  three  related  segments: 
(1)  Program  timing  analysis  and  priority  setup,  (2)  Data  sampling 
and  (3)  Human  operator  interface.  Each  segment  is  independent,  but 
relies  upon  completion  of  related  tasks.  Upon  completion  of  all  tasks, 
the  cycle  of  monitoring  the  total  system  is  complete. 

a.  Timing  Analysis 

This  subprogram  samples  the  system  usage  of  time  to  allow 
the  most  efficient  allocation  among  the  required  tasks.  Any  time 
that  remains  after  the  required  tasks  have  been  completed  is  surplus 
time.  In  most  systems,  this  surplus  time  is  not  utilized.  For  example, 
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if  the  timing  analysis  program  detects  three  milliseconds  of  time  left 
in  a  ten  millisecond  executive  loop,  it  allocates  as  much  of  the 
three  milliseconds  to  Fault  Monitoring  as  is  possible*  The  allocation 
of  this  time  is  used  in  three  different  modes, 

(1)  Liflht  loading  mode.  When  the  computer  system  is 
only  lightly  loaded  (say  30^),  many  fault  monitoring  tasks  may  he 
accomplished,  V/ith  this  amount  of  time  available,  many  monitoring 
tasks  normally  accomplished  under  maintenance  down  time  may  he  loaded 
by  segments  into  the  computer  memory  from  the  disc  by  the  monitor 
program.  Since  the  probability  of  this  lightly  loaded  condition 
occurring  for  a  reasonable  amount  of  time  is  high,  many  time  slots 
allotted  to  the  fault  monitor  may  be  utilized  in  loading  fault 
detection  and  diagnostic  programs  for  later  execution.  Execution 

of  these  programs  in  addition  to  those  discussed  below  maintains  the 
computer  system  at  its  greatest  reliability, 

(2)  Medium  loading.  When  the  computer  system  is  at 
moderate  loading  (say  6ofo) ,  little  program  swapping  of  monitor  routines 
is  allowed.  All  major  system  functions  are  monitored  and  routine 
maintenance  tests  are  performed  only  periodically,  a  section  at  a  time. 
By  concentrating  the  monitoring  function  on  detecting  errors  of  major 
system  functions,  the  up  time  reliability  is  greatly  enhanced  by 
insuring  system  operation.  When  an  error  is  detected,  a  quick  system 
reconfiguration  reduces  the  Mean  Time  to  Repair  (MTTR)  to  near  zero. 

(3)  Heavy  loading.  When  a  computer  system  is  heavily 
loaded  every  millisecond  is  needed  to  maintain  the  system  in  operation. 
This  is  system  utilization  of  about  90  percent.  With  such  little  time 
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available,  most  real-time  multi-processor  systems  accomplish  no  soft- 
ware fault  monitoring  at  all.  The  results  is  that  the  smallest  error 
can  cause  the  total  system  to  fail.  This  is  a  very  had  mistake!   It 
is  in  this  situation  that  on-line  Fault  Monitoring  is  needed  the  most. 
By  monitoring  only  the  most  critical  data  points  and  completing  this 
task  over  many  time  slices,  an  effective  monitoring  program  can  he 
carried  out  even  under  heavy  loading.  Timing  analysis  is  most 
important  when  there  is  very  little  time  available.  The  process  of 
dynamic  time  allocation  can  be  shown  to  be  most  effective  for  the 
heavily  loaded  system.1-   J  When  a  serious  permanent  fault  occurs, 
more  time  may  be  utilized  for  diagnostic  routines  to  accurately  locate 
the  fault.  With  the  imminent  prospect  of  system  failure,  the  locating 
and  correcting  of  the  fault  now  has  highest  priority.  Only  very 
perishable  data  need  be  saved  so  that  an  operable  system  may  be 
restored  on  system  restart.  Therefore,  a  computer  system  with  adequate 
fault  monitoring  will  have  greatly  enhanced  system  reliability  even 
during  critical  periods  of  heavy  loading, 
b.  Data  Sampling 

The  data  that  is  needed  for  fault  monitoring  is  sampled 
through  the  data  paths  and  stored  efficiently  in  core  or  disc.  Re- 
dundant data  is  filtered  and  only  data  permutations  are  actually 
stored.  This  process  requires  an  intricate  scheme  for  storing  data 
since  the  core  space  and  the  time  available  are  both  critical.  For 
example,  the  azimuth  angular  rotation  of  the  missile  mount  is  con- 
sidered to  be  continuous.  Rather  than  store  all  data  points  over 
several  seconds  (about  500  points) ,  only  one  data  point  need  be  saved. 
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This  wo-uld  be  the  "old"  azimuth  angle  and  would  be  compared  to  the 
newly  acquired  angle.  The  difference  would  than  be  compared  against  a 
maximum  allowed  difference.  Whenever  this  maximum  difference  was 
exceded,  an  error  would  be  generated.  Thus  only  three  words  ("old" 
angle,  "new"  angle,  and  maximum  difference)  must  be  stored  compared 
to  possibly  500* 

All  data  samples  needed  for  aiequate  fault  monitoring  are 
grouped  into  sections,  only  those  subprograms  required  are  brought 
into  core  and  executed.  Tor  example,  the  data  elements  (table  1) 
needed  to  be  monitored  for  fault  detection  and  the  related  extraction 
times  (from  table  5)  are  shown  below. 

Data  Access  Time 

1.  Radar  azimuth  8.1  p,  sec 

2.  Radar  target  range  8.1  p,  sec 

3.  Radar  target  bearing  8.1  [i  sec 

4.  Ship  latitude  8.1  \i  sec 

5.  Ship  longitude  8.1  (J,  sec 

6.  Missile  intercept  point  370  \i  sec 

7.  Missile  time  to  fire  370  \x  sec 

8.  Missile  target  destruction  evaluation  370  \i  sec 

9.  Missile  mount  bearing  37°  [i  sec 
10,  Missile  mount  elevation  370  \i  sec 

Each  data  point  can  be  retrieved  individually  or  as  a 
group.  The  access  time  of  an  individual  data  item  is  different  than 
that  of  a  group.  The  best  method  to  choose  is  the  method  that  results 
in  the  smallest  average  access  time  per  data  element.  For  example,  by 
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summing  the  individual  access  time  for  data  items  one  through  five,  we 
obtain  4-0*5  I-1  seconds.  Table  5  shows  that  if  more  than  one  data 
element  of  this  type  is  accessed,  the  time  of  access  is  15«6  M-  seconds. 
Therefore  by  accessing  data  items  one  through  five  as  a  group,  the 
required  access  time  is  78»0  |i  seconds.  In  this  example,  it  is  more 
advantageous  to  access  each  data  item  individually  than  by  a  group. 
Similarly  summing  the  individual  access  time  of  items  six  through  ten, 
we  obtain  1,850  p,  seconds.  Again  using  table  5>  we  find  that  these 
same  data  elements  accessed  as  a  group  require  only  402  u.  seconds.  In 
this  example,  the  data  access  time  of  a  group  is  much  less  than  that 
of  the  same  items  retrieved  individually. 
c.  Human  Operator  Interface 

The  human  operator  interface  mod\ile  accepts  the  sampled 
data  and  analyzes  the  data  for  faults.  Upon  detecting  a  fault,  the 
pertinent  data  is  displayed  to  the  operator  as  an  alert.  Requests  from 
the  operator  are  input  into  this  module  and  displayed  in  the  proper 
format.  All  human  input  actions,  such  as  a  light  pen  hit  or  function 
switch  depressed  are  recognized  by  this  program  module  and  acted  upon. 

If  reconfiguration  is  requested  by  the  system  monitor 
operator,  a  system  study  is  conducted  by  the  reconfiguration  program 
module  to  analyze  the  current  configuration.  Then  the  program  looks 
up  the  entry  in  the  reconfiguration  table  appropriate  to  the  component 
which  has  failed  and  displays  the  recommended  reconfiguration  for 
operator  approval,   (see  fig.  6)  If  the  operator  approves  this 
reconfiguration,  he  presses  a  function  switch  labeled  "accept"  and 
the  program  continues  and  executes  the  recommended  reconfiguration. 
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If  the  system  monitor  operator  disapproves  of  the  recommended  re- 
configuration, he  may  alter  the  display  by  an  appropriate  manner  and 
then  order  the  computer  to  execute  this  new  configuration. 
5.  Combat  Information  Center  Program  (SHUIA) 

A  simulation  of  a  Ship  Combat  Weapon  System  was  programmed 
on  the  two  Adage  Graphic  Terminals  available  at  the  Naval  Postgraduate 
School,   (see  Program  A  and  B)  Three  systems  were  simulated:   (1)  Combat 
Information  Center,  (2)  Radar  and  (3)  Ship  Information. 

The  basic  purpose  of  this  simulation  was  to  provide  a  model 
on  which  to  test  the  ideas  presented  in  the  preceding  section  for  a 
fault  monitoring  system.  The  simulation  provides  an  actual  model  of 
combat  between  a  ship  and  an  aircraft.  A  display  of  the  position  of 
the  ship  and  the  airplane  ha3  been  incorporated  to  allow  visual 
following  of  the  action.  The  simulation  is  programmed  for  the  air- 
plane to  approach  and  attack  the  ship,  firing  missiles  at  the  ship 
when  close  enough.  The  ship  in  turn  must  detect  the  airplane,  and 
make  the  decisions  of  hostility,  of  time  to  fire  and  of  target 
destruction. 

a.  Main  Control  and  Combat  Information  Display  (CICP) 
This  program  operates  in  one  Adage  Graphic  Display 
terminal  (see  Program  A  ).  The  timing  of  the  overall  simulation  is 
controlled  in  this  module. 

The  command  and  control  systems  piurpose  is  to  accept  data 
from  the  radar  simulation  and  make  a  decision  upon  the  identity  of 
the  target.  If  it  is  identified  as  hostile,  a  "kill"  order  is  sent 
to  the  Fire  Control  Computer  System  Module.  Since  the  time  required 
for  the  decision  process  has  a  Poisson  Distribution,  it  is  simulated 
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by  an  constant  eight  second  delay  plus  a  random  delay  from  an 
exponential  random  number  generator  with  an  expectation  of  four 
seconds. 

As  the  fire  control  module  rotates  the  missile  launcher, 
the  missile  launcher  displayed  on  the  ship  moves  in  synchronism  with 
the  missile  launcher  simulation  on  the  Coricor  analog  computer.  After 
lock-on  to  the  target,  the  combat  system  "launches"  a  salvo  of  two 
missiles.  These  are  simulated  on  the  Adage  display  as  a  pair  of 
bright  dots,  one  after  the  other,  originating  from  the  missile  launcher 
and  moving  towards  the  target.   If  the  missiles  "hit"  the  target,  the 
target  explodes  into  many  bits,  simulated  by  many  dots  randomly  spaced. 
If  the  plane  launches  a  missile  and  "hitsM  the  ship,  the  ship  explodes 
in  the  same  way.  All  missiles  are  simulated  as  nuclear  type, 
b.  Radar  and  Ship  Information  Simulator  (RADAR) 

This  program  simulates  the  radar  and  ship  information 
systems  on  the  ship.  It  is  located  in  one  Adage  Graphic  Terminal,  (see 
Program  B) 

The  radar's  purpose  is  to  detect  all  incoming  targets  as 
soon  as  possible  and  relay  information  on  range  and  bearing  to  the 
Combat  System.  For  the  simulated  radar,  a  maximum  range  of  one  hundred 
miles  was  chosen.  The  delay  between  the  time  that  the  approaching 
aircraft  corsses  the  point  of  maximum  range  and  the  time  that  the 
target  data  is  actually  transmitted,  has  a  uniform  random  distribution 
with  a  maximum  of  eight  seconds  and  a  minimum  of  four  seconds.  Eight 
seconds  was  chosen  after  considering  the  antenna  rotational  speed  and 
the  number  of  rotations  required  for  the  radar  operator  to  confirm 
an  actual  target. 


43 


The  Ship  Information  subprogram  simulates  the  ship  as 
moving  on  a  steady  course  at  a  speed  of  thirty  knots.  This  information 
is  passed  to  the  CICP  program  and  is  used  to  move  the  simulated  ship 
display. 

The  radar  simulator  has  a  model  of  a  simulated  airplane. 
The  airplane  model  moves  at  a  speed  of  3,600  knots.  The  airplane 
starts  on  a  course  of  270  T.  When  it  is  within  100  miles  of  the 
ship,  it  turns  to  automatically  attack  the  ship.  Manual  override 
controls  are  provided  to  control  the  course,  altitude  and  missile 
firing.  These  are  controlled  by  function  switches  adjacent  to  the 
display. 

6.  Fire  Control  System  Program  (Missile  Mount) 

A  simulation  of  a  Digital  Fire  Control  System  and  of  a  Missile 
launching  Mount  was  programmed  on  the  Xerox  Data  Systems  (XDS)  9300 
digital  computer  and  the  Comcor  CI-5000  analog  computer.  The  Digital 
Fire  Control  System  was  simulated  on  the  XDS-9300  computer  (see  Program 
C).  The  Missile  Launching  Mount  was  simulated  on  the  CI-5000  analog 
computer  (see  Program  D). 

The  basic  purpose  of  these  simulations  was  to  provide  a  Digital 
Fire  Control  System  and  a  Missile  Mount  to  interact  with  the  simulated  Ship 
Combat  Weapon  System  on  the  Adage  Graphic  Terminals.  The  Digital  Fire 
Control  System  was  written  in  FORTRAN  17  on  the  XDS-9300  computer  and 
uses  its  hybrid  capabilities  to  communicate  with  the  CI-5000  analog 
computer.  The  Missile  Mount  is  simulated  to  act  like  a  real  mechanical- 
electrical  missile  mount.  Upon  assignment  of  a  target  azimuth,  the 
missile  mount  moves  as  a  missile  mount  aboard  ship  would  move. 
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a.  Digital  Fire  Control  System 

The  digital  fire  control  system  accepts  target  information 
from  the  Combat  System  and  converts  this  rectangular  coordinate  data 
to  polar  coordinates  for  the  fire  control  missile  mount.  It  must  then 
order  the  missile  mount  to  move  from  its  present  azimuth  to  the  target 
azimuth.  In  a  total  analog  system,  this  would  be  all  that  v.Tould  be 
required;  the  analog  feedback  system  -would  effect  the  required  movement. 
In  a  digital  system  many  improvements  may  be  gained.  Overshoot  and 
time  to  rotate  can  be  minimized  under  digital  control.  The  digital 
control  utilizes  a  modified  "Bang-Bang"  approach  that  uses  six  phases. 
Each  phase  implements  a  separate  portion  of  the  task  of  moving  the 
missile  mount.  With  appropriate  programming,  the  missile  mount  moves 
at  the  fastest  speed  possible  with  the  smallest  overshoot.  Digital 
Control  optimizes  the  control  of  this  simulated  large  and  massive 
Missile  Mount  as  it  does  in  the  real  case. 

b.  Missile  Launcher  Mount 

The  missile  mount  was  simulated  on  the  CI-50-0-0-  analog 
computer  using  hybrid  computer  techniques.  The  simulated  mount  consists 
of  an  amplidyne  controlled  generator  that  drives  a  large  motor  which  in 
turn  drives  a  100:1  gear  train  connected  to  the  missile  mount.  The 
amplidyne  requires  53  volts  per  field  ampere  and  in  turn  controls  the 
field  coil  of  the  generator  that  can  produce  25  amperes  at  440  volts. 
The  generator  drives  a  200  horsepower  motor  at  speeds  up  to  11 50  EPK. 
The  weight  of  the  mount  is  28  tons  and  may  rotate  at  a  rate  of  up  to 
one  radian  per  second.  The  resulting  transfer  and  analog  computer 
equations  of  the  fourth  order  system  are: 
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V/here      g  -  generator  m  -  motor 

1  -  launcher  f  -  field 

a  -  armature  K  -  constant 

Both  azimuth  and  elevation  controls  are  implemented  on  the 
analog  computer  and  have  "been  verified  to  be  similar  to  a  shipboard 
missile  mount.  The  analog  simulation  adds  realism  to  the  combat 
system  and  allows  actual  hardware  items  to  he  monitored  by  the  fault 
monitoring  system. 
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V.   RECOMMENDED  TECHNIQUES 

By  analyzing  the  process  used  in  fault  monitoring  in  the  simulated 
Ship  Combat  Weapon  System,  the  overall  technique  should  now  be  clear. 
By  applying  the  following  techniques,  a  Real-Time  Multi-Computer 
Monitoring  system  may  be  designed  to  operate  effectively  even  under 
heavy  loading  conditions. 

A.   DYNAMIC  TIME  ALLOCATION 

The  hardware  and  software  data  transfer  rate  between  all  hardware 
components  must  be  accurately  determined  by  an  interface  timing  study. 
This  may  be  accomplished  by  writing  a  simple  program  loop  passing  data 
between  the  components.  All  critical  data  (critical  to  the  hardware  and 
software  partitions)  must  be  determined  and  listed  as  either  hardware  or 
software  accessable  data  needed  for  data  evaluation.   From  the  interface 
timing  study,  the  time  required  to  access  this  data  may  be  determined. 
Trom  this  data  list,  groups  of  data  should  be  determined  so  as  to  best 
fit  the  minimum  time  allotted  to  fault  detection  under  heavily  loaded 
conditions.   This  grouping  must  be  done  in  conjunction  with  the  study 
of  Partitioning.  When  the  final  list  is  completed  and  all  groupings 
made,  this  data  becomes  the  basis  of  the  data  sampling  program  module. 

The  timing  analysis  program  module  works  directly  with  the  data 
sampling  program.   By  analyzing  the  system  resources,  time  allocation 
may  be  distributed  to  a  number  of  data  sampling  group  subprograms  and 
data  analysis  programs.   For  example,  two  milliseconds  may  be  allocated 
to  monitor  all  critical  data  points  of  the  radar  system. 
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B.  PARTITIONING 

The  multi-computer  system  must  be  partitioned  into  hardware  and 
software  logical  subelements.  By  determining  the  degree  of  recon- 
figuration possible,  e.g.,  the  number  of  CPU's,  the  degree  of  partitioning 
becomes  known  in  part  because  partitioning  and  reconfiguration  are 
interrelated.  Both  must  be  determined  together  in  order  to  optimize 
system  resources.  If  two  partitioned  elements  may  not  be  used  inter- 
changeably for  reconfiguration,  then  the  partitioning  is  too  small. 
Partitioning  must  also  consider  what  data  inside  a  proposed  subelement 
iB  critical.  Normally  each  logical  element  of  a  multi-computer 
system  has  a  number  of  data  elements  that  can  be  used  to  determine  when 
the  logical  element  has  failed.  These  data  elements  are  the  critical 
data  points  of  this  partitioned  logical  subelement. 

If  a  computer  program  is  critical  to  the  operation  of  a  multi- 
computer system  and  lias  no  replacement,  then  a  simulation  of  the 
program  should  be  included  in  the  system.  A  simulation  of  a  program 
may  be  a  smaller  version  of  the  replaced  program  or  it  may  be  a  dummy 
program  that  allows  the  total  system  to  remain  operational  at  a 
reduced  level.  Then  a  software  fault  in  this  program,  that  can  not  be 
corrected  by  the  relocation  of  the  program,  may  be  corrected  temporarily 
by  the  use  of  the  simulation.  While  the  simulation  is  maintaining 
the  system  at  a  degradated  level,  the  error  in  the  program  may  be 
corrected  and  the  program  then  reinstated  into  the  system. 

After  the  system  has  been  satisfactorily  partitioned,  the  subsystem 
elements  become  the  system  status  list.  The  logical  connections  (l/o) 
of  these  partitions  are  then  also  fixed  and  inserted  into  tables  in  the 
reconfiguration  program  module. 
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C.  FAULT  RECOGNITION 

Since  the  critical  data  has  "been  determined  -under  the  study  of 
Dynamic  Time  Allocation,  only  the  method  of  fault  recognition  remains. 
Software  errors  are  only  recognized  "by  software  routines,  hut  hardware  ' 
faults  are  best  detected  by  a  combination  of  hardware  and  software. 
If  hardware  devices  are  present  to  detect  the  faults,  then  they  should 
be  used  in  preference  to  software  routines  as  hardware  detection  is 
much  faster.  If  some  faults  require  an  exorbitant  amount  of  time  to 
be  recognized  in  software,  then  the  use  of  special  purpose  hardware 
registers  and  fault  detectors  should  be  studied.  Special  purpose 
hardware  fault  detectors  operate  at  a  much  higher  speed  but  may  be 
more  expensive  than  software  subroutines. 

Permanent  faults  and  errors  may  be  detected  and  analyzed  by  hardware 
or  software,  but  transient  faults  and  errors  may  only  be  economically 
analyzed  by  software  routines.  Provisions  for  detecting  and  analyzing 
transient  errors  must  be  included  in  the  system. 

D.  DIAGNOSTIC  ROUTINES 

Diagnostic  routines  become  smaller  in  fault  monitoring  programs  that 
recognize  faults  at  the  subsystem  level.  Normally  a  fault  may  be  due 
to  any  one  of  hundreds  of  likely  components.  All  components  must  be 
diagnosed  to  determine  which  component  is  at  fault.  Since  any  sub- 
system may  only  have  from  three  to  five  critical  data  points  (for 
example),  the  diagnostic  routine  necessary  to  locate  a  subsystem  error 
may  be  simpler  in  design. 

Normal  diagnostic  routines,  used  to  diagnose  computers  and  special 
hardware  devices,  are  utilized  in  this  fault  monitoring  system  also. 
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Since  most  of  these  routines  require  run  times  of  minutes,  they  must 
be  segmented  into  logical  time  elements  that  may  he  called  by  the 
system  dynamic  time  allocation  routine  and  executed  when  the  computer 
is  lightly  loaded.   In  this  way  complete  diagnostic  analysis  of  the 
total  multi-computer  system  can  be  accomplished. 

E.  SYSTEM  RECONFIGURATION  AND  PRESENTATION 

Items  A-D  above  compile  all  the  necessary  data  for  system  re- 
configuration and  presentation  to  the  System  Monitor  Operator.  The 
system  status  list  that  was  generated  from  the  study  of  Partitioning 
may  now  be  used  to  determine  when  and  how  a  system  might  be  reconfigured, 
A  set  of  possible  configuration  lists  or  even  a  program  that  computes 
an  acceptable  reconfiguration  is  available  for  the  use  of  the  recon- 
figuration program.  Y/hen  req\iested,  the  program  studies  the  submodule 
that  has  failed  to  see  if  it  is  on  the  current  subsystem  active  list, 
(see  fig.  7  for  a  sample  list)  If  it  is,  it  removes  it  and  places  it 
on  the  non-active  list.  The  program  searches  the  possible  configuration 
lists  until  it  finds  a  match  with  the  current  subsystem  active  list.  If 
a  match  is  not  found,  it  notifies  the  system  monitor  operator  and 
halts.  Before  continuing,  if  the  semi-automatic  mode  is  set,  the 
reconfiguration  is  presented  to  the  operator  for  approval.  In  the 
automatic  mode,  this  step  is  omitted.  By  resetting  the  logical  inter- 
face list,  any  faulty  input /output  ports  are  bypassed.  If  needed,  a 
program  may  be  relocated  (by  reloading  it).  If  the  fault  is  programmatic, 
a  suitable  simulation  program  may  be  loaded  to  replace  the  faulty  one. 

The  type  of  presentation  displayed  to  the  monitor  operator  depends 
upon  the  equipment  available  and  the  system  being  monitored.  Since 
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EXAMPLE  OP  A  SUBSYSTEM  ACTIVE  LIST 

ACTIVE  COMPONENT  (see  fig.  5) 

yes  RADAR 

yes  SHIP  DATA 

yes  MONITOR 

yes  CPU  #1 

yes  CORE  1a 

yes  CORE  1b 

yes  CORE  1c 

no  CORE  1d 

yes  DISPLAY  1 

yes  INTERFACE  1 

yes  INPUT  DATA 

yes  OUTPUT  DATA 

yes  CPU  #2 

yes  CORE  2a 

yes  CORE  2b 

no  CORE  2c 

no  CORE  2d 

no  DISPLAY  2 

yes  INTERFACE  2 

yes  DAC 

yes  ADC 


etc.  etc, 

Figure  7 
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the  data  to  "be  presented  is  voluminous  and  time  critical,  any  mecahnical 
device  would  be  too  slow.  Some  type  of  Cathode  Ray  Tube  (CRT)  with 
function  switches  and  maybe  a  typewriter  input  is  needed.  Then  the 
fault,  its  location  and  recommended  solutions  can  be  displayed 
simultaniously.  The  recommended  reconfiguration  presentation  can  be 
either  displayed  as  a  logic  diagram  showing  the  reconfigured  components 
and  their  links  (see  fig.  6)  or  the  two  lists  (old  and  new  recon- 
figuration lists)  can  be  displayed  side  by  side.  Because  of  the  rapid 
visual  assimilation  by  the  operator  of  the  displayed  data,  rapid 
decisions  may  be  made. 
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